\chapter{Conclusions}
\begin{quotation}
\noindent ``\emph{By far the greatest danger of artificial intelligence
	is that people conclude too early that they understand it.}''
\begin{flushright}\textbf{Eliezer Yudkowsky}\end{flushright}
\end{quotation}
\vspace*{0.5cm}

After having reviewed the foundations of artificial neural networks and
reinforcement learning, we went through the reasons why meta-learning is
interesting and useful: not only it allows one to avoid choosing, designing
or parametrising complex task-related strategies to accelerate training, but
most importantly, it allows for agents to learn highly efficiently problems
that have the same structure and to generalise over the parameters of the 
structure -- bluntly put, one can reuse an agent without retraining it, as long
as the problem is similar and the meta-learning agent has seen enough
variation in the training distribution.\\

We have reproduced one experiment originally setup by Wang et al. 
\cite{learningtorl} and Duan et al. \cite{fastrlviaslowrl} which consisted
in meta-learning dependent 2-armed bandit problems. While we achieved the
same results as the two seminal papers cited above, we extended the results
discussion to the dynamics of meta-learning such a problem, showing how
the meta-learning agent develops its strategy to learn a bandit problem faster
and faster. We also studied its generalisation capability for the whole
range of parameters of this problem.\\

Meta-learning was extended to a new set of experiments based on the CartPole
environment. We designed a distribution of CartPole problems by shuffling
the state observation received by the agent (without informing the agent
of the mapping between values and their meaning), and by permutating 
the agent's actions randomly at the start of training trials.\\

We have shown that even though, surprisingly, the agent was able to learn
to discover which problem it was playing in one episode, meaning that:
\begin{enumerate}
	\item it was able to make sense of an unordered set of values in input;
	\item it was able to learn the consequences of its actions;
	\item it could perform 1) and 2) in time to be able to balance the
		pole for a long enough time to succeed the episode.
\end{enumerate}
Although the performance of the agent was near optimal, we found that letting
the agent play for at least two episodes increased its performance as it
was able to learn about the environment and take consequential action in
later episodes to improve its success rate. Quite surprisingly also,
the meta-learning agent performed better in environments which were more
difficult, and the difference between single-episode and multi-episode trials
increased as the problems grew harder and both were tested on previously unseen
problems.\\

There are still parameters to choose manually when designing a meta-learning
agent, and their tuning provides for some interesting dynamics in the 
performance of a trained agent. We have studied how the discount factor
and the number of episodes per trial
played an important role in the evolution of episode-wise reward during the 
training of a meta-learning agent.
We have also tried to understand how a meta-learning agent handled playing
more episodes than what it had been trained for.\\

This deep look into the workings of our agent and its reaction to different
hyperparameters helped us understand an inherent flaw in the reward structure
of the CartPole problem forbidding the agent to perform at its best for
as many episodes as possible, as soon as it could. Indeed, a continuous
stream of rewards of +1 at each timestep drowned the information of "lost
episodes" to the meta-learning agent.\\

This discovery led us to propose a simple way to inject an informative reward
at the end of each episode to allow the agent to play at its best performance
as soon as it could, learning and passing information on to following episodes
so that they could in turn perform even better. After having successfully
trained an agent to balance the pole in CartPole with this informative reward,
we tested it on the novel, unseen task of failing at what it had been trained
to do by starting to give it positive reward only after its failure. To
our surprise, the agent managed to learn to fail within the first episode
of its trials, later taking action to fail its episodes as quickly as it 
could. This new problem was sucessfully learned only within \textbf{one} episode
without having to be retrained.\\

\section{Future work}
There is still much to explore in meta-learning. For instance, could a 
meta-learning agent learn to play problems generated from different environment?
What would be the similarity constraints for it to work? If the agent does
manage to learn to perform well in different environments, could it then
learn to play in a totally unseen environment? Once again, in that case,
what would be the requirements in terms of similarity to ensure success?\\

Furthermore, could environments with fundamental differences (different 
dimensionalities of the state and action sets) be learned by the same agent?\\

We are very confident in the multi-tasking capabilities of the meta-learning
agent, and its ability to add an interpretation level between itself and the 
state, both input-wise (when it receives observations) and output-wise (when
it performs actions); and hope that there will be a strong interest in 
researching the limits of its performance.\\

We suspect that adapting the architecture presented by Wang et al. to allow
the outer algorithm to access more information that is implicitly or
explicitly available to strategy designers (e.g. have knowledge of total 
episode rewards instead of timestep-only rewards) could lead to an overall 
better learning algorithm.\\


\section{Source code}
The source code for our agent, and also the code used to run all the
experiments of this work has written in Python 3 \cite{python3} using
Tensorflow \cite{tensorflow}. It is available at the following url:
\url{https://github.com/fhennecker/meta-reinforcement-learning}.
